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Abstract: Importance sampling is a common technique for Monte Carlo 
approximation, including Monte Carlo approximation of p- values. Here it is 
shown that a simple correction of the usual importance sampling p-values 
creates valid p-values, meaning that a hypothesis test created by rejecting 
the null when the p- value is < o will also have a type I error rate < a. This 
correction uses the importance weight of the original observation, which 
gives valuable diagnostic information under the null hypothesis. Using the 
corrected p-values can be crucial for multiple testing and also in problems 
where evaluating the accuracy of importance sampling approximations is 
difficult. Inverting the corrected p-values provides a useful way to create 
Monte Carlo confidence intervals that maintain the nominal significance 
level and use only a single Monte Carlo sample. Several applications are 
described, including accelerated multiple testing for a large neurophysiolog- 
ical dataset and exact conditional inference for a logistic regression model 
with nuisance parameters. 
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1. Introduction 

Importance sampling is a common technique for Monte Carlo approximation, 
including Monte Carlo approximation of p-values. Besides its use in situations 
where efficient direct-sampling algorithms are unavailable, importance sampling 
can be used to accelerate the approximation of tiny p-values as needed for mul- 
tiple hypothesis testing. Importance sampling can also be used for Monte Carlo 
approximation of confidence intervals using a single Monte Carlo sample by in- 
verting a family of hypothesis tests. The practicality of these two uses of impor- 
tance sampling — accelerated multiple testing and Monte Carlo approximation 
of confidence intervals — would seem to be limited by the fact that each places 
strong requirements on the importance sampling procedure. Multiple testing 
controls can be sensitive to tiny absolute errors (but large relative errors) in 
small p-values, which would seem to demand either excessively large Monte 
Carlo sample sizes or unrealistically accurate importance sampling proposal 
distributions in order to reduce absolute Monte Carlo approximation errors to 
tolerable levels. The confidence interval procedure uses a single proposal distri- 
bution to approximate probabilities under a large family of target distributions, 
which again would seem to demand large sample sizes in order to overcome the 
high importance sampling variability that is frequently encountered when the 
proposal distribution is not tailored to a specific target distribution. Neverthe- 
less, as shown here, simple corrections of the usual importance sampling p- value 
approximations can be used to overcome these difficulties, making importance 
sampling a practical choice for accelerated multiple testing and for constructing 
Monte Carlo confidence intervals. 

The p-value corrections that we introduce require negligible additional com- 
putation and still converge to the target p-value, but have the additional prop- 
erty that they are valid p-values, meaning the probability of rejecting a null 
hypothesis is < a for any specified level a. Hypothesis tests and confidence in- 
tervals constructed from the corrected p-value approximations are guaranteed to 
be conservative, regardless of the Monte Carlo sample size, while still behaving 
like the target tests and intervals for sufficiently large Monte Carlo sample sizes. 
The combination of being both conservative and consistent turns out to be cru- 
cial in many applications where the importance sampling variability cannot be 
adequately controlled with practical amounts of computing, including multiple 
testing, confidence intervals, and any hypothesis testing situation where the true 
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variability of the importance sampling algorithm is either large or unknown. We 
demonstrate the practical utility of the correction with several examples below. 

Let X denote the original observation (data set). Assume that the null hy- 
pothesis specifies a known distribution P for X. For a specified test statistic t, 
the goal is to compute the p-value p{X) defined by 

p{x) ^ PT{t{X) > t{x)) 

where the probability is computed under the null hypothesis. Importance sam- 
pling can be used to Monte Carlo approximate p{X) if the p-value cannot be 
determined analytically. Let Y = {Yi, . . . ,Yn) be an independent and identi- 
cally distributed (i.i.d.) sample from a distribution Q (the proposal distribution) 
whose support includes the support of P (the target distribution). Then p{X) 
can be approximated with either 

^(^^ A J:tMY^HtiY,)>tiX)} 
n 

. T:=MY^Hm)>t{x)} 

where 1 is the indicator function and where the importance weights are defined 
to be 

in the discrete case, and the ratio of densities in the continuous case. Each of 
these are consistent approximations of p{X) as the Monte Carlo sample size 
increases. The former is unbiased and is especially useful for approximating 
extremely small p-values. The latter can be evaluated even if the importance 
weights are only known up to a constant of proportionality. Note that n is the 
Monte Carlo sample size; the sample size or dimensionality of X is irrelevant 
for the developments here. The reader is referred to Liu (2001) for details and 
references concerning importance sampling and to Lehmann & Romano (2005) 
for hypothesis testing. 

Here we propose the following simple corrections of p and p that make use of 
the importance weight of the original observation, namely, 

^ A ^<X) + EL i w{Y,)t{t{Y,) > t{X)} 

P* 7 ^ ) 



P*iX,Y)^ 



1+71 

MX) + ELi wiY)t{t{Y,) > t{X)} 

w{x) + j:umyj) 



We will show that the corrected p-value approximations, while clearly still con- 
sistent approximations of the target p-value p{X), are also themselves valid 
p-values, meaning 

Pt{% {X, Y) <a) <a and Pr(p* (X, Y) < a) < a 
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for all a € [0, 1] and n > under the null hypothesis, where the probability is 
with respect to the joint distribution of data and Monte Carlo sample. These 
simple corrections have far-reaching consequences and enable importance sam- 
pling to be successfully used in a variety of situations where it would otherwise 
fail. Of special notes are the ability to properly control for multiple hypothesis 
tests and the ability to create valid confidence intervals using a single Monte 
Carlo sample. 

2. Main results 

The main results are that and p* are valid p- values (Theorems 1 and 2). 
We generalize the introductory discussion in two directions. First, we allow 
arbitrary distributions, so the importance weights become the Radon-Nikodym 
derivative dP/dQ. In the discrete case this simplifies to the ratio of probability 
mass functions, as in the introduction, and in the continuous case this simplifies 
to the ratio of probability density functions. Second, we allow the choice of 
test statistic to depend on (X, Yi, . . . as long as the choice is invariant to 
permutations of (X, Yi, . . . , y„). We express this mathematically by writing the 
test statistic, t, as a function of two arguments: the first is the same as before, 
but the second argument takes the entire sequence (X, Yi, . . . , 1^), although 
we require that t is invariant to permutations in the second argument. For 
example, we may want to transform the sequence (X, Yi, . . . , y„) in some way, 
either before or after applying a test-statistic to the individual entries. As long as 
the transformation procedure is permutation invariant (such as centering and 
scaling, or converting to ranks), everything is fine. Transformations are often 
desirable in multiple testing contexts for improving balance (Westfall & Young, 
1993). 

We begin with the precise notation and assumptions for the theorems. P and 
Q are probability distributions defined on the same measurable space (S*, S) with 
P ^ Q (meaning that sets with positive P probability also have positive Q prob- 
ability), and w is a fixed, nonnegative version of the Radon-Nikodym derivative 
dP/dQ. M denotes the set of all {n + 1)1 permutations tt = (ttq, . . . ,7r„) of 
(0, . . . , n). For tt € M and z = {zq, . . . , z„), we define z^'^^ = (z^^,, . . . , z^^^). 
Assume that t : S x 5*"+-^ i-)- [— oo, oo] has the property that 

t(a,z) = t(a,z^'^^) 

for all a e S, z e 5"+^, and tt e M. 
For z e S'"+i define 



n+1 

A Y.l=ow{zi)t{t{z,,z) >t{z^,z)} 



where we take 0/0 = 0. Let X have distribution P and let Yo,Yi, . . . ,Yn be 
an i.i.d. sample from Q, independent of X. For notational convenience define 
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Z= {Zo,Zi,...,Zn) by 

Zq — A, Z/i — li, . . . , An — In 

SO that the corrected p- values are %{Z) and Then, 
Theorem 1. Pr(p*(Z) < a) < a for all a G [0, 1]. 
Theorem 2. Pr(p*(Z) < a) < a for all a e [0, 1]. 

Proofs are in the Appendix. The theorems do not require any special rela- 
tionships among P, Q, t, or n. For example, in parametric settings Q does not 
need to be in the same model class as the null and/or alternative. Validity of 
the corrected p- values is ensured even for unusual cases such as n = or impor- 
tance weights with infinite variance. We discuss some practical considerations 
for choosing Q in the next section (presumably, P, t and n are dictated by the 
application and the computational budget). Validity of the corrected p- values 
is well known for the special case of direct sampling, i.e., Q = P and w = 1. 
For Markov chain Monte Carlo (MCMC) approaches, Bcsag & Chfford (1989) 
demonstrate how to generate valid p- value approximations using techniques that 
are unrelated to the ones here. 

The theorems continue to hold if each Yk is chosen from a different Q, say 
Qk, as long as the sequence Qo, Qi, . . . , Qn is itself i.i.d. from some distribution 
over proposal distributions. The fcth importance weight is now a fixed version 
of the Radon- Nikodym derivative dP/dQk- This generalization can be useful in 
practice when each Yk is generated hierarchically by first choosing an auxiliary 
variable Vk and then, given Vfc, choosing Y^. from some distribution Q^'' that 
depends on Vk- If the marginal distribution (i.e., Q) of Yfc is not available, one 
can use importance weights based on the conditional distribution (i.e., Q^'') 
of Yfc given Vk- This is equivalent to using random proposal distributions as 
described above. The drawback of combining this approach with the p-value 
corrections, is that the importance weight of the original observation is evalu- 
ated using dP/dQo, where Qo is a randomly chosen proposal distribution. This 
further increases the Monte Carlo randomness already inherent in the p-value 
approximations. Note that generalizing the theorems to allow for random pro- 
posals requires no additional work. If v is the joint distribution of each (Qk, Yk) 
and vq is the marginal distribution of each Q^., simply apply the theorems with 
z/Q X P in place of P, v in place of Q, (Qo:-^) in place of X, and (Qfe,!^) 
in place of Yk- If one uses a regular conditional distribution to define the im- 
portance weight \d(yQ x P) j dv\{Qk,x\ then it will simplify to a version of 
{dQkldP\(x). 

Finally, we recall the well known fact that any valid family of p-values can 
be inverted in the usual way to give valid confidence intervals. In particular, 
consider a collection of distributions {Pq : G 8} and, for each 6* S 0, let 
V*{Q, Z) be any vahd p-value for testing the null hypothesis of Pg. Fix a € [0, 1] 
and define the random set 

C"(Z) ^{ee Q:p,{e,Z) > a} 
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Then C"{Z) is a 1 — a confidence set for 9 because 



Pre(C"(Z) 3 9)= 1 - Pie{p,{9, Z) < a)> 1 - a 



wliere Pig is calculated under the null hypothesis of Pg. An appealing feature 
about inverting either p.^, or p.^, is that the same importance samples can (in 
principle) be used for testing each 9. Only the importance weights (and perhaps 
the test statistic) vary with 9. Section 4.2 below illustrates this application. 
The idea of using importance sampling to construct confidence intervals from 
a single Monte Carlo sample was pointed out in Green (1992). See Bolviken & 
Skovlund (1996) and Garthwaite & Buckland (1992) for examples of other ways 
to create Monte Carlo confidence intervals. 

3. Practical considerations 

It is clear that each of the corrected p- values has the same asymptotic behavior 
as its uncorrected counterpart, so for sufficiently large n they are essentially 
equivalent. But in practice, n will often be too small. The concern for the inves- 
tigator is that the corrected p-values may behave much worse than the uncor- 
rected ones for practical choices of n. In particular, the investigator is concerned 
about the following situation: the null hypothesis is false; the target p- value, p, is 
small; the uncorrected approximations, p or p, are also small; but the corrected 
approximations, p^ or p*, are large. Careful choice of Q can lower the chances 
of this situation. 

Each of the corrected p-values can be expressed as an interpolation between 
their respective uncorrected versions and another (non-Monte Carlo) valid p- 
value: 



where we are using the shorthand z = (x, i/i, . . . , j/„). In both cases, power 
suffers when X comes from a distribution in the alternative hypothesis, but 
w{X) is typically large relative to w{Yi), . . . ,w{Yn). Since w{Yk) always has 
mean 1, problems might arise for alternatives that tend to make w{X) much 
larger than 1. This problem can be avoided by choosing Q so that it gives more 
weight than does P to regions of the sample space that are more typical under 
alternatives than they are under P. The problem can also be avoided by choosing 
Q to be similar to P so that the weights are close to 1 throughout the sample 
space. Most proposal distributions are designed with one of these two goals in 
mind, so the corrected p-values should behave well for well-designed proposal 
distributions. In practice, however, proposal distributions can be quite bad, and 
it can be helpful to look more closely at how the proposal affects the power of 
the corrected p-values. 
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From (1) we see that p^, is an interpolation between p and w{x), the latter of 
which is a valid p-value. Validity of w{X) can be seen either by Theorem 1 for 
n = or by the simple calculation 

Pt{w{X) < a) = ^l{P(a;) < aQ{x)}P{x) < ^aQ(a;) = a (3) 

X X 

which easily generalizes to more abstract settings. Testing w(X) — P{X) /Q{X) < 
a is simply a likelihood ratio test (LRT) of P versus Q, although using the crit- 
ical value of a will give a test of size smaller than a. The test can be strongly 
conservative, because the bound in (3) can be far from tight. So is an in- 
terpolation between a conservative LRT of P versus Q and an (often liberal) 
importance sampling Monte Carlo approximation of the target p-value. If the 
effect of w{X) on has not disappeared (such as for small n), then is a 
p-value for the original null, P, versus a new alternative that is some amalga- 
mation of the original alternative and the proposal distribution Q. If Q is not in 
the alternative (as it often will not be), this modified alternative is important 
to keep in mind when interpreting any rejections. 

The effect of Q on interpreting rejections is especially important in multiple- 
testing situations. In these settings, importance sampling is often used to ac- 
celerate the approximation of tiny p-values via p. If p « 0, however, then 
p* « w{x) /(n + 1) and Q plays a critical role in determining which null hypothe- 
ses are rejected after correcting for multiple comparisons. The investigator can 
take advantage of this by ensuring that a rejection of P in favor of Q is sensible 
for the problems at hand, and by ensuring that events with small p-values will 
be heavily weighted by Q. 

Turning to (2), we see that is an interpolation between 1 and p. Since 
p < 1 , we always have 

P*>P 

so that always leads to a more conservative test than p. The degree of con- 
servativeness is controlled by the ratio w{x) / J^j "^(Vj)- If tti^ ratio is small, 
then the correction ensures validity with little loss of power. If it is large, there 
may be a substantial loss of power when using the correction. The ratio will 
be approximately 1/n if P and Q are similar, which is often the design criteria 
when using p. 

An important caveat for the main results is that Q is not allowed to depend 
on the observed value of X. This caveat precludes one of the classical uses of 
importance sampling, namely, using the observed value of t(X) to design a Q 
that heavily weights the event {x : t{x) > t{X)}. In many cases, however, since 
the functional form of t is known, one can a priori design a Q that will heavily 
weight the event {x : t{x) > t{X)} whenever the p-value would be small. For 
example, given a family of proposal distributions {Qi}e, each of which might be 
useful for a limited range of observed values of t{X), one can use finite mixture 
distributions of the form Q — X^f^i ^eQi to obtain more robust performance. 
For similar reasons, mixture distributions are also useful for creating Monte 
Carlo confidence intervals in which the same proposal distribution is used for a 

imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: April 12, 2011 



M.T. Harrison/Conservative Hypothesis Tests and Confidence Intervals 



8 



family of target distributions. The examples in Sections 4.2 and 4.3 have more 
details, and Hesterberg (1995) contains a more general discussion of the utility 
of mixture distributions for importance sampling. Finally, we note that this 
caveat about Q does not preclude conditional inference. In conditional inference 
P is the null conditional distribution of X given some appropriate A{X), and 
all of the analysis takes place after conditioning on A{X). The choice of Q (and 
also t, for that matter) can thus depend on A{X), but not on additional details 
about X. 

4. Applications 

4-1- Accelerating multiple permutation tests 

Consider a collection of N datasets. The ith dataset X* = (V*,i*) contains a 
sample of values V — ( V7 , . . . , V^l^ . ) and corresponding labels = {L\ , . . . , . ) 
The distributions over values and labels are unknown and perhaps unrelated 
across datasets. We are interested in identifying which datasets show a depen- 
dence between the values and the labels. From a multiple hypothesis testing 
perspective, the ith null hypothesis is that and are independent, or more 
generally, that the values of L"^ are exchangeable conditioned on the values of 
V^. Given a test statistic t{x) = t{v,£), a permutation test p-value for the ith. 
null hypothesis is given by 



where the sum is over all permutations tt = (tti, . . . , TTm.) of (1, . . . , rrii), and 
where the notation £^'^^ = (f^^ , . . . , , ) denotes a permuted version of the 
elements of £ using the permutation tt. The beauty of the permutation test is 
that it converts a large composite null (independence of values and labels) into 
a simple null (all permutations of the labels are equally likely) by conditioning 
on the values and the labels but not their pairings. The ith null distribution, P\ 
becomes the uniform distribution over permutations (of the labels). The Bonfer- 
roni correction can be used to control the family- wise error rate (FWER), i.e., 
the probability of even a single false rejection among all N datasets (Lehmann & 
Romano, 2005). In particular, rejecting null hypothesis i whenever p{X^) < a/N 
ensures that the probability of even a single false rejection is no more than a. 
Although computing p exactly is often prohibitive, Monte Carlo samples from 
P* are readily available for approximating p, and importance sampling can be 
used to accelerate the approximation of tiny p- values. Bonferroni is still sensible 
as long as the Monte Carlo approximate p-values are valid p-values. 

Here is a specific simulation example. Consider N = 10** independent data- 
sets, each with to; — m — 100 real-valued values and corresponding binary 
labels. In each case there are 40 labels of 1 and 60 labels of 0. In 9990 datasets, 
the labels and values are independent, in which case the values are i.i.d. standard 
Cauchy (i.e., with scale parameter 1). In 10 of the datasets, the labels and values 
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are dependent, in which case the values associated to label are i.i.d. standard 
Cauchy and the values associated to label 1 are i.i.d. standard Cauchy plus 2 
(i.e., a standard Cauchy shifted to the right by 2). (This particular example was 
chosen because standard tests like the the two-sample i-test and the Wilcoxon 
rank-sum test tend to perform poorly.) For the alternative hypothesis that label 
1 tends to have larger values than label 0, a sensible test statistic is the difference 
in medians between values associated to label 1 and values associated to label 
0, namely, 

t{x) = t{v,£) = median(wj : £j = 1) — median(t;j : £j — 0) 

Table 1 shows the Bonferroni-corrected performance of several different Monte 
Carlo approaches as the Monte Carlo sample size (n) increases. The approxima- 
tions p and p* refer to p and , respectively, for the special case of Q = P and 
w = 1, i.e., direct sampling (all permutations are equally likely). Using either p 
or p^ would be the standard method of approximating p. The approximations 
p and are based on a proposal distribution that prefers permutations that 
pair large values of t with label 1 (see the Appendix for details). In each case we 
reject when the approximate p- value is < a/N. The final approximation g refers 
to a Wald 1 — a/ (2N) upper confidence limit for p using the same importance 
samples to estimate a standard error. For q we reject whenever q < a/{2N), 
which, if the confidence limits were exact, would correctly control the FWER 
at level a. 

Table 1 

Bonferroni-corrected testing performance for different p-value approximations versus Monte 
Carlo sample size (n); Hq is false for 10 out of N = 10* tests 

# correct rejections # incorrect rejections 



n 


lOi 


102 


10^ 


10* 


lOi 


102 


10^ 


10* 


P 


10 


10 


10 


10 


908 


87 


12 


1 


P* 


























P 


9 


8 


7 


5 


7579 


1903 


2 





P* 


7 


6 


6 


5 














<? 


6 


5 


4 


3 


4545 


585 









Only p^ works well. The approximations that are not guaranteed to be valid 
{Pi q) require excessively large n before the false detection rate drops to 
acceptable levels. The cases n = 10^ and n = 10'' (even the largest of which was 
still too small for p to reach the Bonferonni target of zero false rejections) are 
computationally burdensome, and the situation only worsens as N increases. 
Furthermore, in a real problem the investigator has no way of determining that 
n is large enough. The confidence limit procedure similarly failed because the 
estimated standard errors are too variable. 

The valid p- values (p*, %) ensure that the Bonferonni correction works re- 
gardless of n, but > l/(n-|- 1), so it is not useful after a Bonferonni correction 
unless n is extremely large. The good performance of requires the combina- 
tion of importance sampling, which allows tiny p-values to be approximated 
with small n, and the validity correction introduced in this paper, which allows 
multiple-testing adjustments to work properly. 
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4.2. Exact inference for covariate effects in Rasch models 

Let X = {Xij : i = 1,...,M; j = 1,...,N) he a. binary M x N matrix. 
Consider the following logistic regression model for X. The Xij are independent 
Bernoulli(pij) where 

log ^K + a, + + 0v,j (4) 

t Pij 

for unknown coefficients k, cy. ~ (ai, . . . , a^f ), and f3 = . . . , /Sat), and known 
covariates v = [vij : i — 1, . . . , M; j — 1, . . . , N). The special case = is called 
the Rasch model and is commonly used to model the response of M subjects to 
N binary questions (Rasch, 1961). In this example, we discuss inference about 
9 when k, ck, (3 are treated as nuisance parameters. 

Consider first the problem of testing the null hypothesis oi 9 = 9' versus the 
alternative of ^ ^ 9' . Conditioning on the row and column sums removes the 
nuisance parameters, and the original composite null hypothesis reduces to the 
simple (conditional) null hypothesis oi X ^ Pqi, where Pg> is the conditional 
distribution of X given the row and column sums for the model in (4) with 
K = ai = ■ ■ ■ — aM = Pi = ■ • • = Pn = Q and 9 = 9' . A sensible test statistic is 
the minimal sufRcient statistic for 9, namely, 

^(•^) ~ y ] XjjVjj 

Since t{X) has power for detecting 9 > 9' and -~t{X) has power for detecting 
9 < 9' , it is common to combine upper- and lower-tailed p- values into a single 
p- value, p^{9',X), defined by 

p+{9,x)^PTg{tiX)>t{x)) 
p-{9,x)^PTg{-t{X)>-t{x)) 

p^{9,x) = min{l,2min{p+(6l,a;),p-(6',x)}} (5) 

where Prg uses X ^ Pg. There are no practical algorithms for computing 
p^{9',X) nor for direct sampling from Pgr. The importance sampling algorithm 
in Chen et al. (2005) (designed for the case 9 = 0) can be modified, however, 
to create a sensible and practical proposal distribution, say Qg , for any Pg . The 
details of these proposal distributions wih be reported elsewhere. Qg{x) can be 
evaluated, but Pg{x) is known only up to a normalization constant, so we must 
use p and p*. Here we compute two Monte Carlo p- value approximations, one 
for each of t and —t, and combine them as in (5). Validity of the underlying 
p- values ensures validity the combined p- value. 

Confidence intervals for 9 can be constructed in the usual way by inverting 
this family of tests. In particular, a 1 — a confidence set for 9 is the set of 
9' for which we do not reject the null hypothesis oi 9 = 9' at level a. An 
annoying feature of inverting Monte Carlo hypothesis tests is that in most cases 
a new sample is needed for each 9', which increases the computational burden 
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and creates pathologically shaped confidence sets, because the Monte Carlo 
randomness varies across 9' (Bolviken & Skovlund, 1996). Importance sampling 
suffers from the same problems, unless the same importance samples, and hence 
the same proposal distribution, are used for testing each 9' . 

To create a common proposal distribution that might work well in practice, 
we use a mixture of proposals designed for specific 9. In particular, define 



L 

where [0i,...,9l) are fixed and each Qe^ is a proposal distribution appro- 
priate for Pqi, as mentioned earlier. Here we use L = 601 and {9i, . . . ,9^) = 
(—6.00, —5.98, —5.96, . . . , 6.00). For any 9, the importance weights are 

fa ^ A Pe{x) ej,p{9t{x)) 
Q{x) iZtQeAx) 

for any binary matrix x with the correct row and column sums, where cg is an un- 
known constant that is not needed for computing p and p*. We will use p^{9, z) 
and p~ {9, z) to denote p{z) when computed using the importance weights w{9,-) 
and the test statistics +t and —t, respectively, where z = {x,yi, . . . , Un)- Sim- 
ilarly, define p^{9,z) and p^{9,z). The upper and lower p- values can be com- 
bined via 

p^ = min{ 1 , 2 min{p^ iP }} ^-nd p^ = min| 1 , 2 min{p^ ? pC} } 

In light of Theorem 2, pf{9', Z) is a valid p- value for testing the null hypothesis 
that 9 = 9'. 

Inverting these p-values gives the confidence sets 

C^iz) ^{9eR: p^{9,z) > a} 
C^iz)^{9eR:pt{9,z)>a} 

For fixed z, the approximate p-values are well-behaved functions of 9, so it 
is straightforward to numerically approximate the confidence sets (which will 
typically be intervals). 

Table 2 describes the results of a simulation experiment that investigated 
the coverage properties of C" and C". In that experiment, we took M = 200, 

= 10, and fixed the parameters k, 9, a, /3 and the covariates v (see the Ap- 
pendix for details). We set 6* = 2 and repeatedly (1000 times) generated datasets 
and importance samples in order to estimate the true coverage probabilities and 
the median interval lengths of various confidence intervals. Confidence intervals 
based on the corrected p-values always maintain a coverage probability of at 
least 1 — a without becoming excessively long, while those based on uncorrected 
p-values can have much lower coverage probabilities. As the number of impor- 
tance samples increases, the two confidence intervals begin to agree. 
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Table 2 

Performance of 95% confidence intervals versus Monte Carlo sample size (n). 
coverage prob (%) median length 



n 


10 


50 


100 


10 


50 


100 


P 

P* 


27.1 

99.3 


77.9 
98.0 


87.9 
96.9 


0.36 
2.10 


1.00 
1.52 


1.10 
1.38 



Incidentally, importance sampling seems to be the only currently available 
method for quickly creating (even approximate) frequentist confidence intervals 
using conditional inference in this model for a dataset of this size. Exact condi- 
tional logistic regression methods (e.g., StataCorp (2009) or Cytcl (2010)) are 
unable to enumerate the state space or generate direct samples with a few gi- 
gabytes of memory. And the corresponding MCMC methods (e.g., Zamar ct al. 
(2007)) often fail to adequately sample the state space, even after many hours 
of computing. Classical, unconditional, asymptotic confidence intervals also be- 
have poorly. For the simulation described in Table 2, 95% Wald intervals only 
had 82.2% coverage probability with a median length of 1.24. 

4-3. Accelerated testing for neurophy Biological data 

The example in this section is similar to the example in Section 4.1. It is mo- 
tivated by a practical multiple testing problem that arises in neurophysiology. 
Neurophysiologists can routinely record the simultaneous electrical activity of 
over 100 individual neurons in a behaving animal's brain. Each neuron's spiking 
electrical activity can be modeled as a temporal point process. Neurophysiol- 
ogists are interested in detecting various types of spatio-temporal correlations 
between these point processes, an endeavor which often leads to challenging 
multiple testing problems. For example, simply testing for non-zero correlation 
between all pairs of 100 neurons gives (^2") different tests, and neurophysiolo- 
gists are often interested in much more complicated situations than this. This 
section demonstrates how to use importance sampling to accelerate a Monte 
Carlo hypothesis test for lagged correlation used in Fujisawa ct al. (2008), here- 
after FAHB, and then to correctly control for multiple tests via the p- value 
correction. 

Let X = {T,N) be a marked point process where T = {Ti, . . . ,Tm) are 
the nondecreasing event times taking values in {0,1, . . . , B — 1} and N = 
(iVi, . . . , Nm) are marks taking values in {1, ... , C}. Let = (T/, ...,TIjJ = 
(Tfe : Nk = i) be the sequence of event times with mark i, which in this context 
are the firing times (rounded to the nearest millisecond) of neuron i. A neuron 
can fire at most once per ms, so the times in each are strictly increasing. 
The dataset used here is described in FAHB and has C = 117 neurons and was 
recorded over a period oi B = 2.812 x 10^ ms (about 47 minutes) from a rat 
performing a working memory task in a simple maze. There are M = 763501 
total events. Fix A = 10 ms. For each ordered pair of neurons (i,j), i j, 
FAHB were interested in testing the null hypothesis that the joint conditional 
distribution of and was uniform given [T^/AJ and [T-'/AJ, where [a\ is 
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the greatest integer < a. Note that [rVAJ = ([T*/AJ, . . . , [TIjJA]) gives a 
temporally coarsened version of neuron i's firing times — it only preserves the 
total number of firings in each length-A window of a fixed partition of time. 
FAHB used two different test statistics t+{T\T^) and t-{T\T^) defined by 

t+{T\T^) ^ ^max^^^ - U ^ d} 

k=ii=i 

Mi Mj 

t- {T\T') ^ ^ min ^ ^ ^ t{Tl - ^ d} 
k=i 1=1 

each with power to detect different alternatives that they hoped to distinguish. 
(There are some minor technical differences between the tests described here and 
those reported in FAHB.) Loosely speaking, the tests were designed to detect a 
particular type of transient, fast-temporal, lagged-correlation between neurons 
(via the test statistic) while ignoring all types of slow-temporal correlations (via 
the conditioning). Additional motivation, references and details can be found in 
FAHB. 

Testing a specific null and test statistic combination is easy: it is trivial to 
sample from the null (i.e., conditionally uniform) distribution and to Monte 
Carlo approximate p-values for any test statistic. Unfortunately, in this case 
there are 2C(C — 1) = 27144 different hypothesis tests, which leads to a chal- 
lenging multiple testing problem. Much like the example in Section 4.1, im- 
practically large Monte Carlo sample sizes are needed to successfully use direct 
sampling (i.e., Q = P) in conjunction with a multiple testing correction such 
as Bonferroni. FAHB did not directly address this particular multiple testing 
problem, but they did use overly conservative individual p-values in hopes of 
reducing the severity of the problem. 

Here we observe that importance sampling in conjunction with p.^, permits 
proper and practical use of multiple testing corrections. The importance sam- 
pling proposal distributions that we use are mixture distributions much like the 
one in Section 4.2 (see the Appendix for details). We use a different proposal 
distribution for each hypothesis test depending on the choice of test statistic and 
on the neuron pair. Using n — 100 and rejecting whenever < 0.05/ (2C(C— 1)) 
controls the family-wise error rate (FWER) at level 0.05. 

FAHB reported 78 rejections. We can confirm about one third of them using 
this much more stringent FWER procedure. The qualitative results of FAHB 
remain the same using this smaller set of rejections, reassuring us that the 
scientific conclusions of FAHB are not the result of statistical artifacts caused 
by improper control of multiple hypothesis tests. 

5. Discussion 

The practical benefits of using either or p.^, are clear. Valid p-values are 
always crucial for multiple testing adjustments. But even for individual tests, 
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the p-value corrections protect against false rejections resulting from high (and 
perhaps undiagnosed) variability in the importance sampling approximations. 
There is almost no computational penalty for using the corrections. And there 
is little or no loss of power for well-behaved importance sampling algorithms. 
These advantages extend to confidence intervals constructed by inverting the 
corrected p- values. 

Monte Carlo approximations should always be accompanied by further ap- 
proximations of the Monte Carlo standard error. Approximate standard errors 
give useful diagnostic information, but they can be very misleading. The poor 
performance of q in Section 4.1 shows that the uncorrected p-value approxi- 
mations cannot be meaningfully corrected by relying on approximate standard 
errors. An interesting issue is whether or not the p-value corrections should also 
be accompanied by approximate standard errors. Further research is warranted 
on this issue, but at this time, it seems sensible to report both the uncorrected p- 
value approximations with their approximate standard errors and the corrected 
p- values with no standard errors. This provides useful information about esti- 
mating the target p-value and about Monte Carlo variability, but also permits 
interpretable hypothesis testing and multiple testing adjustments, without con- 
fusing the two issues. One danger that must be avoided is the temptation to use 
close agreement between the corrected and uncorrected p- values as an indicator 
of convergence. The Appendix contains additional examples that demonstrate 
how all four p-value approximations (and their standard errors) can be in close 
agreement but still far from the truth (toward which they are converging). 

Although the p-value corrections are extremely simple, they are not always 
available using off-the-shelf importance sampling software, because the software 
may not permit evaluation of the importance weight of the original data. Adding 
this capability is almost always trivial for the software designer — instead of 
actually calling random number generators, simply see what values would lead 
to the original data and evaluate their probability — but investigators may be 
unwilling or unable to modify existing software. Hopefully, software designers 
will begin to include this capability in all importance sampling algorithms. We 
have found that the generic ability to evaluate a proposal distribution at any 
point (including the observed data) is quite useful for diagnostic purposes, even 
if one does not expect to make use of the p-value corrections. 

The techniques described here are examples of a more general principle of 
combining observations from both target and proposal distributions in order 
to improve Monte Carlo importance sampling approximations. Suppose, for ex- 
ample, that our computational budget would allow a small sample from the 
target and a large sample from the proposal. What is the best way to combine 
these samples for various types of inference? As we have shown, even a single 
observation from the target can be used to provide much more than diagnostic 
information. It can be used, surprisingly, to ensure that (often extremely poor) 
hypothesis test and confidence interval approximations maintain the nominal 
significance levels. Perhaps a single observation from the target could also be 
used to improve approximations of point estimators in some way? 

As importance sampling continues to gain prominence as a tool for Monte 
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Carlo approximation, so do the innovations in importance sampling techniques. 
Most of these will not fit into the classical importance sampling framework 
needed for the theorems here. It remains future work to investigate the degree 
to which the original data can inform and correct more sophisticated sequential 
Monte Carlo methods. But it seems clear that this valuable source of diagnostic 
information should not be completely ignored. 

Appendix A: Proofs 

The proofs of both Theorem 1 and Theorem 2 rely on a simple algebraic fact, 
encapsulated in the next lemma. 

Lemma 3. For all tg, ■ ■ • , Ui G [~oo, oo] and all a,wo, ■ ■ ■ ,Wn G [0, oo], we have 

n f n \ 

^ «;fel < 51 >tk}<a \ <a 

fe=0 li=0 J 

Proof. Let H denote the left side of the desired inequality. We can assume that 
H > 0, since the statement is trivial otherwise. The pairs (to, wq), . . . , (t„, w„) 
can be reordered without affecting the value of _ff , so we can assume that to > 
■ ■ ■ >tn. This implies that X]"=o ^i-'^i^* — increasing in k, and that there 

exists a k* defined as the largest k for which 



Y,w^l{U > tk} < 

So 



a 

i=0 



k* k* 



k=0 i=0 

A.l. Proof of Theorem 1 

For any sequence y = {yo, . . . ,yn) and any k e {0, . . . ,n}, let 

y'' = {yk,yi, ■ ■ . ,yk~i,yo,yk+i, ■■■,yn) 

which is the sequence obtained by swapping the 0th element and the fcth element 
in the original sequence y. We begin with a lemma. 

Lemma 4. For any nonnegative, measurable function f , 

\ k=Q J 
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Proof. Recall that Zi = ^ Q for i 7^ and that Zq — X ^ P. A change of 
variables from X to Yq gives 

E(/(Z)) - E{w{Yo)f{Y)) 

Since the distribution of Y is invariant to permutations, we have 

E{w{Yo)fiY)) = E{w{Yk)f{Y'^)) 

for each fc = 0, . . . , n, which means that 

1 " 

E{f{Z))=E{w{Yo)f{Y)) = Y.H^(Y,)f{Y'')) 

k=a 

Moving the sum inside the expectation completes the proof. □ 
Applying Lemma 4 to the function f{z)^ t{p^{z) < a] gives 



Pr(p,(Z) <a) = 



E(l{p,(Z) < a}) = E 1^-^ < a) 



The quantity inside the final expectation is always < a, which follows from 
Lemma 3 by taking te = t{Ye, Y) and wg = w{Ye) / {n + 1) for each £ — 0, . . . ,n. 
This completes the proof of Theorem 1. 

A. 2. Proof of Theorem 2 

Let n denote a random permutation chosen uniformly from M and chosen 
independently of Z. Let U = Z^'^"> . If tt G is a fixed permutation, then 11 
and n*^'^) have the same distribution, which means U and f/^'^^ have the same 
distribution. In particular, U and TJ^ have the same distribution, where the 
notation is defined in the proof of Theorem 1. We begin with two lemmas, 
and then the proof of Theorem 2 is nearly identical to that of Theorem 1. We 
use the convention 0/0 = 0. 

Lemma 5. Let Pz and Pu denote the distributions of Z and U , respectively, 
over 5""+^. Then Pz <C Pu o,nd 

dPz _ (n + 1)w(mo) 
dPu EjU^("j) 

almost surely. 
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Proof. Let g be any nonnegative, measurable function on S"'^^. We need only 
show that 

For any tt G there is a unique inverse permutation 7r~^ G M with = 
TT^-i = j, for each i = 0, . . . ,n. Comparing with the proof of Lemma 4, for any 
nonnegative, measurable function / we have 

E(/(Z(-))) = E(/(rW)u;(ro)) = E{fiY)w{Y^-.)) (7) 

where the first equality is a change of variables from Z to Y and the second 
equality follows from the fact that the distribution of Y is permutation invariant, 
so, in particular, Y and Y^'^ ^ have the same distribution. Using (7) gives 

= E E(/(t/)|n = tt) = E E(/(ZW)) 

1 1 " 

^ ^ TreA^ ^ ' j=0 neM: 



^0 =1 

1^ E E(/(Y)..(y,)) = E (^1^/0^)) (8) 
Applying (8) to the function f{u) = {n + l)g{u)w{ua)/ {J2^=o ^("i)) gives 



(n + 1)! 



= E{wiYa)giY)) = E(.g(Z)) 

where the last equality is a change of variables as in (7). This gives (6) and 
completes the proof. □ 

Lemma 6. For any nonnegative, measurable function f , 

Proof. Changing variables from Z to U and using Lemma 5 gives 

E(/(^))-Ef^±ll^/([/)) 
Since the distribution of U is invariant to permutations, we have 
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for each k — 0, . . . ,n, which means that 

Moving the sum inside the expectation and cancelhng the (n + l)'s completes 
the proof. □ 

Applying Lemma 6 to the function f{z) — t{p^{z) < a] gives 

Pr{p4Z) < a) = E{t{MZ) < a}) - E (j^ ^n^^'l^M MU") < a}] 

\k=0 ^j=o^^'^i> ) 

^ ^ {S e£!^'<'<^"^' ^ '<^"^" ^ "}) 

The quantity inside the final expectation is always < a, which follows from 
Lemma 3 by taking = t{Ue,U) and Wi = w(J7£)/(^"^q w(C/j)) for each 
£ — 0, . . . ,n. This completes the proof of Theorem 2. 



Appendix B: Proposal distributions and simulation details 

The simulation example in Section 4.1 uses conditional inference, so the target 
and proposal distributions for each dataset i (notation suppressed) can depend 
on the observed values V = {Vi, . . . , Vm) and the fact that there are r one-labels 
and m — r zero-labels (but cannot depend on the observed pairing of labels and 
values). Let I — (/i, . . . , J^) be a permutation that makes V non-increasing, i.e., 
Vii > ■ ■ ■ > Vj^. Choose a random permutation 11 = (Ei, . . . , E^) according 
to the distribution 

^ ' H(™-r)!ELo(I.)(r.')exp(^fc) ^ ' ' ^ 

where the binomial coefficients (^) = a!/(6!(a — b)l) are defined to be zero if 
a < 0, 6 < 0, or a < 6. Leaving V in the original observed order and permuting 
L so that i/nj — ■ ■ ■ = Lj-^^ ~ 1 and Lj^ = . . . = Lj-^^^^ = gives a random 
pairing of values with labels. The case = is the uniform distribution over 
permutations, i.e., the target null conditional distribution. We used = 3 for 
the proposal distribution which assigns higher probability to those permutations 
that tend to match the label one with larger values. 

The simulation example in Section 4.3 also uses conditional inference, so the 
target and proposal distributions for each pair of neurons (i,j) can depend on 
the temporally coarsened event times [TYAJ and [T-'/AJ . Define the set of all 
event times with the same temporal coarsening as neuron i to be 

^ {m e {0, . . . , B - 1}^^' : < • • • < Lwfc/AJ = LT^VAJ ,Vfc} 
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For any s e and a G Z define 

Ra{s)^\{k: lsk/A\ ^a}\ 

so that, for example, 

Tlie target distribution (i.e., the null hypothesis for pair (i,j)) is the uniform 
distribution over il, x ilj. For any s E flj, u E fli, d E Z, and e M define 

which is a probability distribution over flj for fixed u, d, and 0. The case = 
is the uniform distribution over il^; the case 9 > Q prefers those s with event 
times that match those in it + d; and the case 6 <Q prefers those s that do not. 
We used proposal distributions of the form 

Pr(r ^u,T^=s)^\j2 ^^^\^p{s. u, d, 9,) 

which choose event times uniformly for neuron i and then choose times in neuron 
j that prefer (or avoid) times of a specific lag from those of neuron i. For 
the test statistic i+ we used {6q, . . . ,94) = (0, .5, .5, .5, .5) and for we used 
(0,-.5,-.5,-.5,-.5). 

All simulations and analysis were performed with custom software written 
in Matlab 2010b and executed on a 2.66 GHz iMac with 8GB of RAM using 
Matlab's default pseudo random number generator (Mersenne Twister). The pa- 
rameters used for the logistic regression example in Section 4.2 are k = —1.628, 
f3 = (0.210, -0.066, 0.576, -0.197, 0.231, 0.184, -0.034, -0.279, -0.396, 0.000) and 
a = (-0.183, -0.735, -0.144, -0.756, -0.749, -0.226, -0.538, -0.213, -0.118, -0.284, 
-0.127, -0.632, -0.132, 0.104, 0.000, -0.781, -0.500, -0.498, -0.182, -0.269, -0.077, 
-0.499, -0.661, -0.780, -0.095, -0.661, -0.478, -0.315, -0.638, -0.225, -0.382, -0.715, 
-0.085, -0.766, -0.573, -0.629, -0.336, -0.775, -0.461, -0.762, -0.754, -0.082, -0.575, 
-0.263, 0.098, -0.434, -0.172, -0.109, -0.434, -0.211, -0.757, 0.067, -0.679, -0.601, 
-0.069, -0.379, -0.098, -0.471, -0.594, -0.830, -0.193, -0.437, -0.415, -0.257, -0.807, 
-0.551, -0.094, -0.170, -0.741, -0.737, -0.774, -0.859, -0.444, -0.211, -0.144, -0.336, 
-0.758, -0.235, -0.740, -0.732, -0.768, -0.725, -0.698, -0.671, -0.549, -0.550, -0.649, 
-0.616, 0.026, -0.164, -0.311, -0.682, -0.655, -0.789, 0.047, -0.160, -0.309, -0.553, 
-0.701, -0.244, 0.121, -0.696, -0.609, -0.470, -0.793, -0.183, -0.464, 0.116, -0.465, 
-0.246, -0.712, -0.485, -0.706, -0.109, 0.004, -0.516, -0.181, -0.573, -0.336, -0.034, 
-0.269, -0.531, -0.568, -0.414, -0.444, -0.507, -0.308, -0.124, -0.442, -0.437, -0.742, 
-0.842, -0.577, -0.549, -0.213, 0.090, 0.069, -0.409, -0.626, -0.103, -0.107, -0.126, 
-0.123, -0.761, -0.185, -0.403, -0.655, -0.768, -0.043, -0.692, -0.703, -0.201, 0.028, 
-0.350, -0.164, -0.713, 0.087, -0.326, -0.187, -0.830, -0.058, -0.118, -0.747, -0.342, 
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-0.541, -0.320, -0.468, -0.452, -0.686, -0.611, -0.846, 0.057, -0.213, 0.066, -0.703, 
0.054, -0.072, -0.289, -0.427, -0.609, -0.115, -0.638, -0.803, -0.099, -0.196, -0.152, 
-0.225, -0.448, -0.476, -0.051, -0.549, -0.052, -0.078, -0.014, -0.361, -0.231, 0.084, 
-0.423, -0.807, 0.000). The final entries of f3 and a were constrained to be zero 
to avoid identifiability problems. The covariates u are recorded as a 200 x 10 
matrix in the ASCII file nu. csv. 

Appendix C: Additional examples 
C.l. Mismatched Gaussians 

For the first example, we take P to be Normal(0, 1), Q to be Normal(/z, cr), 
and use the test statistic t{x) = x. We experiment with a few choices of /x, a 
and n, and in each case compute the cumulative distribution functions (cdfs) 
and mean-squared errors (MSEs) of the approximations p, p, % , and of the 
true p- value p{X) = 1 — ^{X), where $ is the standard normal cdf. Actually, 
we do not compute these quantities directly, but approximate them using 10^ 
Monte Carlo samples (each of which uses a new X and new importance samples 
Yi, . . . ,Yn). Figure 1 and Table 3 show the results. In each case, % and seem 
to be valid p- values, confirming the theory. There are many cases where p and 
p are not valid. 

The MSEs of all the estimators seem comparable, with none uniformly better 
than the rest, although for a majority of these examples p has the smallest 
MSE. (We use min{p, 1} and min{p*, 1} instead of p and p*, respectively, for 
computing MSE.) An important special case among these examples is n = 0, 
a = 0.2, for which p and p have MSEs up to 10 times smaller than the MSEs 
of and p*, respectively. For this special case, none of the estimators does a 
good job approximating p, even with n = 10^, but and p* are conservative, 
whereas p and p are strongly liberal (especially for small p- values). In many 
hypothesis testing contexts, the conservative choice is desirable, despite worse 
MSE. 

C.2. Conditional testing in binary tables 

The next example was inspired by Bezakova et al. (2006) which describes a 
situation where an importance sampling algorithm taken from the literature 
converges extremely slowly. Let AT be a binary 52 x 102 matrix with row sums 
(51, 1, 1, . . . , 1) and column sums (1, 1,1,..., 1). The null hypothesis is that X 
came from the uniform distribution over the class of all binary matrices with 
these row and column sums. The alternative hypothesis is that X came from 
a non- uniform distribution that prefers configurations where the 51 ones in the 
first row tend to clump near the later columns. For any binary matrix x with 
these row and column sums, let £i{x) < ■ ■ ■ < i^iix) denote the indices of the 
columns for which x has ones in the first row. For example, if the first row is 
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A cumulative distribution functions B cumulative distribution functions 

for standard p-values for modified p-values 

H = 0. a = 0.2 1.1 = 0, o . 1 n = 0, o = 5 ii = 0, o = 0.2 (i = 0, o - 1 n = 0, a = 5 




X 10 X 10 X 1 X 10 X 10 X 1 



Fig 1. Estimated cumulative distribution functions (cdfs) for the approximate p-values in 
Example C. 1 under the null hypothesis. Each plot corresponds to a specific choice of fi and 
a for the proposal distribution. The plots in panel A show the cdfs for p (solid lines) and p 
(dashed lines) for each of n = 10,1000,100000, for a total of six cdfs on each graph. The 
plots in panel B show pt (solid lines) and pt (dashed lines) for each of n = 10,1000,100000. 
(Each cdf was estimated using 10® Monte Carlo experiments, each of which involved a new 
X and new Yi, . . . ,Yn. Since cdfs take values in [0,1], the maximum possible variance of 
any of these estimators is (1/4)10~®, which implies that an interval of H0~^ around any 
estimate is at least a 95% confidence interval.) The cdf of the target p-value, p, is the cdf 
of a UniformiO, 1) random variable, namely, F{x) = x, and is shown with a dotted line 
along the diagonal of each plot. The different choices of n are not labeled on the plots, but 
where the curves are distinguishable, larger n will always be closer to the diagonal, since the 
approximate p-values always converge to the target p-value. A valid p-value has a cdf that 
lies on or below the diagonal. If the cdf drops strictly below the diagonal, this indicates a loss 
of power. If the cdf exceeds the diagonal, this indicates that the resulting hypothesis test does 
not correctly control the type I error. Note that the corrected p-values are always valid, but 
that the original p-values are often invalid. 



(0, 1, 0, 1, 0, . . . , 1), then (^i, . . . , 4i) = (2, 4, 6, ... , 102). A suitable test statistic 
for distinguishing the null from the alternative is 

51 

Suppose that we observe a matrix X with (^i(X), . . . ,^5i(X)) ~ (1, 7, 8, 10, 
11, 15, 16, 17, 20, 21, 28, 29, 30, 36, 37, 40, 41, 42, 48, 49, 51, 54, 55, 56, 57, 58, 
60, 61, 62, 63, 65, 67, 68, 69, 70, 73, 75, 77, 80, 81, 82, 85, 86, 87, 91, 92, 94, 95, 
96, 97, 100), so that t{X) = 2813. What is the p-value p{X)7 

Chen ct al. (2005) suggest a proposal distribution for approximate uniform 
generation of binary matrices with specified row and column sums. (They ac- 
tually suggest a variety of proposal distributions. We use the same algorithm 
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Table 3 

Estimated MSEs of the approximate p-values (smallest for each row is bold) 



Q MSE 





a 


n 


p 




P* 




P 




P* 







0.2 


10 


1.16e - 


01 


2.54e - 


01 


5.12e 


- 02 


2.37e - 


01 





0.2 


1000 


5.98e - 


02 


1.98e - 


01 


2.43e 


- 02 


1.75e - 


01 





0.2 


100000 


3.39e - 


02 


1.52e - 


01 


1.41e 


- 02 


1.33e - 


01 





1 


10 


1.67e - 


02 


1.66e - 


02 


1.67e 


- 02 


1.66e - 


02 





1 


1000 


1.67e - 


04 


1.67e - 


04 


1.67e 


- 04 


1.67e - 


04 





1 


100000 


1.66e - 


06 


1.66e - 


06 


1.66e 


- 06 


1.66e - 


06 





5 


10 


8.58e - 


02 


9.52e - 


02 


6.80e 


- 02 


6.37e - 


02 





5 


1000 


1.38e - 


03 


1.39e - 


03 


4.96e 


- 04 


4.96e - 


04 





5 


100000 


1.45e - 


05 


1.45e - 


05 


4.94e 


- 06 


4.94e - 


06 


3 


0.2 


10 


3.23e - 


01 


3.23e - 


01 


3.31e 


- 01 


3.31e - 


01 


3 


0.2 


1000 


3.17e - 


01 


3.15e - 


01 


3.26e 


- 01 


3.26e - 


01 


3 


0.2 


100000 


3.08e - 


01 


3.07e - 


01 


3.20e 


- 01 


3.20e - 


01 


3 


1 


10 


1.94e - 


01 


1.83e - 


01 


2.53e 


- 01 


2.56e - 


01 


3 


1 


1000 


2.48e - 


02 


2.04e - 


02 


5.18e 


- 02 


5.06e - 


02 


3 


1 


100000 


7.69e - 


04 


6.06e - 


04 


4.76e 


- 03 


4.65e - 


03 


3 


5 


10 


9.62e - 


02 


1.05e - 


01 


8.63e 


- 02 


7.92e - 


02 


3 


5 


1000 


1.62e - 


03 


1.63e - 


03 


5.97e 


- 04 


5.98e - 


04 


3 


5 


100000 


1.72e - 


05 


1.72e - 


05 


5.93e 


- 06 


5.93e - 


06 


-3 


0.2 


10 


3.34e - 


01 


3.42e - 


01 


3.30e 


- 01 


3.34e - 


01 


-3 


0.2 


1000 


3.33e - 


01 


3.49e - 


01 


3.26e 


- 01 


3.36e - 


01 


-3 


0.2 


100000 


3.32e - 


01 


3.57e - 


01 


3.20e 


- 01 


3.39e - 


01 


-3 


1 


10 


2.80e - 


01 


3.99e - 


01 


2.53e 


- 01 


3.20e - 


01 


-3 


1 


1000 


1.05e - 


01 


2.68e - 


01 


5.17e 


- 02 


1.44e - 


01 


-3 


1 


100000 


1.62e - 


02 


3.22e - 


02 


4.79e 


- 03 


1.21e - 


02 


-3 


5 


10 


1.06e - 


01 


1.23e - 


01 


8.60e 


- 02 


7.82e - 


02 


-3 


5 


1000 


1.79e - 


03 


l.Sle - 


03 


5.98e 


- 04 


5.98e - 


04 


-3 


5 


100000 


1.88e - 


05 


1.88e - 


05 


5.93e 


- 06 


5.93e - 


06 



that they used in their examples: samphng columns successively and using con- 
ditional Poisson sampling weights oi ri/{n ~ ri). Note that the basic and more 
delicate algorithms that they describe are identical for the example here. Note 
also that we could have used their algorithm to sample rows successively, in 
which case the sampling would be exactly uniform because of the symmetry 
for this particular example.) Applying their algorithm to this problem with 
n — 2 X 10^ samples from the proposal distribution gives approximate p-values 
and standard errors of p = 0.068 ± 0.013 and the correction = 0.068. (We 
also have p = 0.066 ± 0.014 and = 0.066, although normally they would not 
be available because the importance weights are known only up to a constant 
of proportionality. The exact weights are available in this particular example 
because of the special symmetry.) The estimated squared coefficient of varia- 
tion (i.e., the sample variance divided by the square of the sample mean) for 
the importance weights is cw^ = 20249, which is extremely large, but which 
suggests an effective sample size of around n/{l + cv'^) « 988 according to the 
heuristic in Kong ct al. (1994). While much smaller than 2 x 10^, a sample size of 
988 would usually ensure reasonable convergence using direct sampling and the 
estimated standard errors seem to agree with this heuristic. Furthermore, the 
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solid black line in Figure 2 shows how p changes with n. It seems to have stabi- 
lized by the end. Thus, we might be tempted to take p — 0.068 as a reasonable 
approximation of p{X) and report a "p-value" of 0.068. 



Convergence of Approximate p-values 




Monte Carlo sample size (n) 

Fig 2. Evolution of the approximated p-values with increasing Monte Carlo sample size (n) 
in Example C.2. The dashed black line shows (an excellent approximation of) the target p- 
value. All of the other lines in the plot will converge to this value as n oo, because they 
are all consistent estimators. The solid black line shows the uncorrected importance sampling 
p-value approximation, p. For a fixed value of the observed test statistic, t{X), this line does 
not depend on the choice of X. The thin gray lines show the corrected importance sampling 
p-value approximation, p«, for different choices of X, but each with the same value of the 
observed test statistic, t{X). Unlike p, p* does depend on the choice of X. The figure shows 
50 thin gray lines. The thick gray line is the average of 1000 such thin gray lines (the first 
50 of which are shown). The X's for these 1000 were chosen uniformly from the set of X's 
with the same test statistic. The same sequence of 2 X 10^ importance samples were used in 
all cases. 

It turns out that 0.068 is not a good approximation of p{X). The special 
symmetry in this particular example permits exact and efficient uniform sam- 
pling — each of the (5°^) choices for the first row is equally likely, and each 
of the remaining 51! choices for the remainder of the matrix is equally likely. 
The true p-value is very nearly p{X) — 0.107, as estimated by p using n = 10^ 
i.i.d. Monte Carlo samples from P. This is shown as the black dashed line in 
Figure 2. 

Despite the fact that p and p* are essentially identical for this example, and 
despite the fact that they do not give a good estimate of the target p-value, 
p{X), the suggestion here is that reporting as a p-value correctly preserves 
the interpretation of a p-value, whereas, reporting p as a p-value is misleading. 
If we were to repeat this test many times, generating new X's sampled from the 
null (uniform) distribution and new importance samples, we would find that p 
is much too liberal — there will be too many small p-values. The correction in 
p* does not have this problem. 
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Generating 2 x 10^ samples from Q is computationally demanding, so we 
do not have a simulation experiment to illustrate the assertion in the previous 
paragraph. We do, however, illustrate the idea with two less demanding com- 
putations. The first approximates cdfs of p and for n = 1000, much like the 
experiments in Example C.l. Results are shown in Figure 3. The uncorrected 
p-values are extremely liberal — rejecting when p < 0.05 results in a type I 
error-rate of 40%. The corrected p-values are conservative, but valid, correctly 
preserving the interpretation of a level a test. 



cumulative distribution functions 

1 1 — ^ — ^ — ^ — ^ — ^ — ^ — ^ — ^ — ^ — 




X 1 



Fig 3. Cumulative distribution functions under the null hypothesis for p (dashed line) andp^ 
(solid line) using n = 1000 in Example C.2. The cdfs are approximated with 10* Monte Carlo 
repetitions. The dotted line is the Uniform{0, 1) cdf. It is visually indistinguishable from the 
cdf of the target p-value, p{X), although not identical, because of the discreteness of the test 
statistic. 

Our second illustration uses the same 2 x 10^ importance samples as before, 
but simply changes the original observation X to a new one with the same value 
of t. For example, consider another observation of X with (^i(X), . . . , £51 (X)) = 
(1, 2, 6, 8, 10, 14, 16, 18, 19, 20, 21, 23, 25, 29, 32, 33, 36, 38, 42, 44, 46, 49, 
50, 53, 54, 57, 60, 62, 63, 67, 68, 69, 70, 72, 73, 76, 78, 81, 82, 88, 89, 92, 
93, 94, 95, 96, 97, 99, 100, 101, 102), which has the same value of the test 
statistic t{X) = 2813, and consequently the same target p-value. It also has the 
same approximate uncorrected p-value, p = 0.068 (using the same importance 
samples from before). But it does not have the same corrected p-value. In this 
case = 0.3907. The change results from the fact that this new observation of 
X is much rarer than the previous one under the importance sampling proposal 
distribution — so rare, in fact, that even n = 2 x 10'' is insufhcient to mitigate the 
effect that its importance weight has on the corrected p-value. The 50 different 
thin gray lines in Figure 2 show how varies with n using 50 different choices of 
X, all with t{X) = 2813. The thick gray line shows the average of 1000 different 
choices of X, all with t{X) = 2813. The new X's were chosen uniformly subject 
to the constraint that t{X) — 2813, which is actually quite efficient in this 
special case by using direct sampling from P, followed by rejection sampling. 
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By paying attention to X, and not just t{X), p^, can be a valid p-value, even 
though it is not a good estimator of p{X) using n « 10^. 

This class of importance sampling algorithms is used, for example, in testing 
goodness of fit in Rasch models and in the statistical analysis of ecological 
data. Chen et al. (2005) use an occurrence matrix for "Darwin's finch data" 
to illustrate their importance sampling approach for testing the uniformity of 
zero-one tables. The observed table, X, is a 13 x 17 binary matrix indicating 
which of 13 species of finch inhabit which of 17 islands in the Galapagos. The 
(ordered) row sums of X are (17, 14, 14, 13, 12, 11, 10, 10, 10, 6, 2, 2, 1) and the 
(ordered) column sums are (11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 7, 4, 4, 4, 3, 3, 3). A 
scientifically relevant null hypothesis is that X was selected uniformly among all 
possible occurrence matrices with the same sequence of row and column sums. 
A scientifically relevant test statistic is 



where {A)ij denotes entry in matrix A, and where A"^ denotes the transpose 
of A. The test statistic should be large if there is competition among species. 
The observed data has t{X) — 53.1. See Chen ct al. (2005) for details and 
references. 

Chen et al. (2005) use the importance sampling algorithm described above to 
generate an approximate p-value for this hypothesis test. They report p- values 
of p = (4.0 ± 2.8) X 10"^ using n = 10'^ and p = (3.96 ± 0.36) x 10"^ using 
n — 10^. The error terms are approximate standard errors, estimated from 
the importance samples. Repeating their analyses, but with new importance 
samples, gives p = (7.77 ± 7.78) x 10"'' and = 13.92 x 10""* using n = lO'', 
andp = (4.32±0.51)x 10~* andp* = 4.38x10""* using n = 10^. An investigator 
reporting any of the values of p, even with estimates of standard errors, has few 
guarantees. But reporting either of the values of is guaranteed to preserve 
the usual interpretation of a p-value. 
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